DNA hybridization to mismatched templates: a chip study 
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High-density oligonucleotide arrays are among the most rapidly expanding technologies in biology 
today. In the GeneChip system, the reconstruction of the target concentration depends upon the 
diflterential signal generated from hybridizing the target RNA to two nearly identical templates: a 
perfect match (PM) and a single mismatch (MM) probe. It has been observed that a large fraction 
of MM probes repeatably bind targets better than the PMs, against the usual expectation from 
sequence-specific hybridization; this is difficult to interpret in terms of the underlying physics. We 
examine this problem via a statistical analysis of a large set of microarray experiments. We classify 
the probes according to their signal to noise (S/N) ratio, defined as the eccentricity of a (PM, MM) 
pair's 'trajectory' across many experiments. Of those probes having large S/N (> 3) only a fraction 
behave consistently with the commonly assumed hybridization model. Our results imply that the 
physics of DNA hybridization in microarrays is more complex than expected, and they suggest new 
ways of constructing estimators for the target RNA concentration. 

PACS numbers: 87.15.-v, 82.39.-k, 82.39.Pj 
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Interest in the detailed physics of DNA hybridization is 
rooted in both purely theoretical and practical reasons. 
Studies of the denaturing transition started with mod- 
els of perfectly homogeneous DNA ffl, soon followed by 
studies of sequence-specific disorder g ||, || . The speci- 
ficity with which DNA binds to its exact complement 
as opposed to a mismatched copy (a "defect") has been 
studied experimentally [|[ |^ and theoretically |0, |[ |j. 
In this context it has been found that a fair fraction of 
the energetics of DNA hybridization is related to stacking 
interactions between first-neighbor bases, in addition to 
the obvious strand-strand contact In this paper 

we present a study of mismatch hybridization stemming 
from a very practical problem, hybridization in DNA mi- 
croarrays. We shall show experimental evidence that the 
system displays behavior which appears to be hard to ac- 
count for on the basis of the extant view of hybridization 
specificities. 

DNA microarrays provide an experimental technique 
for measuring thousands of individual mRNA concentra- 
tions present in a given target mixture. They are made 
by depositing DNA oligonucleotide sequences (probes) at 
specific locations on solid substrates. The probes can be 
either pre-made sequences as in cDNA spotted arrays, or 
they can be grown in situ, letter by letter, as in high- 
density oligonucleotide arrays . The target mRNA is 
amplified (into either cDNA or cRNA depending on the 
protocol) and the product labeled fluorecently before be- 
ing hybridized onto the array. The spatial distribution 
of fiuorescence is then measured using a laser, provid- 
ing estimates for the target concentrations. In GeneChip 
arrays, the synthesis of probe sequences by photolitho- 
graphic techniques requires a number of different masks 
and deposition processes per added base, making it im- 
practical to grow more than a few dozen nucleotides. For 
such lengths, hybridization specificity is not expected to 



be high enough. To solve this conundrum, GeneChip 
technology is based on a two-fold approach, involving re- 
dundancy and differential signal [|l2|, First, sev- 
eral different sequence snippets (each 25 bases long) are 
used to probe a single transcript; and second, each of 
these probes comes in two flavors. The perfect match 
(PM) is perfectly complementary to a portion of the tar- 
get sequence whereas the single mismatch (MM) carries 
a substitution to the complementary base at its middle 
(13th) position. The rationale behind MM sequences 
is that they are expected to probe for non-specific hy- 
bridization in a manner that we shall detail below. 

In current incarnations of the chips, each gene is 
probed by 14-20 (PM,MM) pairs (a probeset), and the 
task is therefore to reconstruct a single number (the 
target concentration) from these 28-40 measurements. 
There are conceivably many ways in which this can be 
done, with various degrees of noise rejection. The stan- 
dard algorithm provided in the software suite of- 
fers one method. However, as independent experimen- 
tal techniques for measuring mRNA concentrations (like 
northern blots) provided clues that the analysis process 
should be improved upon, many researchers attempted 
to do so; it was then discovered that a fair number of 
MM probes consistently report higher fiuorescence signal 
than their PM counterpart . This observation is most 
intriguing because it violates the standard hybridization 
model as outlined below. Thus, the notion that the spe- 
cific binding signal alone can be obtained as a differential 
of the PM and MM signals appears to fail in a subset of 
the probes. 

We shall show below that it is not a matter of a few 
stray probes, by carefully examining the statistics of PM- 
MM pairs. These statistics show that most of the probes 
misbehave to various degrees. Only a fraction of them, 
having MM > PM, exhibit a flagrant violation of the 
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basic assumptions, however, these just point most obvi- 
ously at the symptoms of a deeper problem which affects 
all probes. Given the number of laboratories who are 
currently carrying out such hybridization experiments, 
squeezing out even a meager extra bit of signal to noise 
ratio from the data would be very valuable. It has be- 
come clear that this shall not happen in the absence of 
a better understanding of DNA hybridization to slightly 
mismatched templates. We shall now attempt the first 
step toward this goal, which is to characterize the prob- 
lem. 
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FIG. 1: Joint probability distribution P(log PM, log MM) 
for two large datasets after background subtraction, (a) 
86 HG-U95A human chips, human blood extracts, (b) 24 
MullK/A mouse chips, mouse brain extracts. Please notice 
that three obvious features are present in both: the probabil- 
ity cloud forks into two lobes at high intensity, and an intense 
"button" lies between the two forks right in the middle of 
the range. Notice that the lower lobe is completely contained 
below the diagonal MM = PM. 

The rationale behind the use of MM probes is con- 
tained in the standard hybridization model p7| : 



PM 
MM 
PM - MM 



Is + Ins + B 
{l-a)Is + lNS + B 
als 



Here PM [MM) are the measured brightness of the PM 
(MM) probe. Is the contribution from specific comple- 
mentary binding, /jvs the amount from nonspecific bind- 
ing assumed to be insensitive to the substitution, and B 



TABLE I: Statistics of probe pairs with MM > PM taken 
across a large GeneChip data collection. "%PS with > 1" 
means "percent of probesets with more than one MM > PM 
pair" . The yeast chip (last column) is noticeably different and 
better behaved than the other cases. 
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# pairs per PS 
chips analyzed 
% MM > PM 
% PS with > 1 
% PS with > 5 
% PS with > 10 
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a background of physical origin, i.e. the photodetector 
dark current or light reflections from the scanning pro- 
cess. Then a is the reduction of specific binding due to 
the single mismatch. These brightnesses are related to 
the quantity of interest (the RNA concentration in the 
sample) through: 



Is 
Ins 



k[S] 
h[NS] 



where [S] denotes the concentration of target RNA, [NS] 
the concentration of whatever mixture contributes to the 
nonspecific hybridization, k and h are probe dependent 
specific and nonspecific susceptibilities (possibly concen- 
tration dependent) and include effects such as the areal 
density of probe, various affinities, transcript length de- 
pendent effects (longer transcripts are likely to carry 
more fiuorophors depending on the labeling technique). 

While it is no secret that the physics of hybridiza- 
tion is way more complex than this simplistic model, 
one could still hope that it would essentially provide a 
correct picture of GeneChip hybridizations. To summa- 
rize, let us outline the basic assumptions made so far: (i) 
non-specific binding is identical in PM and MM, meaning 
that Ins does not see the letter change; (ii) a > 0; (in) 
k and h identical for PM and MM; (iv) fc, h and a are 
reasonably uniform numbers across a probe set. Notice 
that (i)-t-(u) imply that PM > MM always (see below). 

If PM — MM is not used as such, the background B 
needs to be subtracted from the intensities, which can be 
done in a statistically proper way as described in ||l^. 

According to the basic tenets of the standard model, 
it follows that PM > MM for all probe pairs if the tar- 
get RNA extract contains no sequences matching exactly 
the MM. In reality, one observes a vast number of probe 
pairs for which this assumption is violated; this behav- 
ior repeats consistently for a broad range of conditions. 
Our experience is that most people in the know think 
of this problem in terms of an imperfect adherence to 
the standard model, or a bothersome deviation from an 
otherwise properly behaving norm. In other words, the 
way this problem is usually characterized is "there's a 
number of probe pairs that don't work and we don't un- 
derstand why". We shall show now that this is not so: 
the MM > PM pairs are so abundant that we like to 
propose the alternate view that the model is simply in- 
adequate for describing what actually happens, and that 
we do not understand the basic physics of MM hybridiza- 
tion. Table 1 summarizes the statistics for various chip 
series. 

The human HG-U95A chip series, for instance, has 
400K probes for 12K different probesets. Across a wide 
variety of conditions, we have observed approximately 
30% of all probe pairs have MM > PM. This figure, by 
itself enormous, would be easy to dismiss if most of them 
were in the low intensity range, where noise is expected 
to be relatively higher and could conceivably be bigger 
than \PA'I — MM\, or if they were clustered in a small 
set of problematic probesets. Neither is true: 91% of 
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FIG. 2: Histogram of probe center of mass, (a) All probes 
(to be compared with Fig la), (b) Only those probescts with 
eccentricities e > 3. (c) The probescts of (b), further re- 
stricted to Large excursions (Ai > 0.168, the top third of all 
probesets). (d) same as (c) for small excursions (Ai < 0.121, 
the bottom third). Notice that (c) consists of all probe pairs 
with small S/N and largo signal, while (d) consists of pairs 
which both have small S/N and small signal (bottom third). 



all probescts have at least 1 probe pair with MM > 
PM , and stiU 60% of probesets have 5 such probe pairs 
out of 16. In addition, the MM > PM pairs are fairly 
distributed with respect to brightness (cf. Fig. 1). 

What could conceivably be the source for observing 
MM > PMl A perplexing extra bit of information lies 
in a simple statistic, the joint probability distribution 
P(log PM, log MM). According to the standard model, 



PM 



Is + Ins + B 



MM (1 - a)Is +Ins + B 

If Is dominates over Ins+B then PM/MM ^ 1/(1 -a), 
while if Is vanishes (as when the transcript is just not 
present in the sample) then PM/MM —* 1. 
Thus we expect 

PM 1 
1 < < 



MM 

Thus, the standard model predicts that 
P(log PM, log MM) should be supported in a band, with 
lower limit corresponding to the diagonal PM = MM 
when cross- hybridization dominates, and with an upper 
limit given by MM = (1 — a)PM for fully specific 
binding. Naively one would further assume that for low 
brightness most of the signal comes from nonspecific 
binding, while most would come from specific binding 



for high brightness. Fig. 1 shows something quite 
otherwise: as brightness increases, the joint probability 
distribution forks into two branches. The crest of the 
lower one lies fully below the MM = PM diagonal. 

The characteristic shapes of P(log PM, log M M) are 
likely signatures of sequence-dependent effects. However, 
any hypothesis is impossible to verify as the probe se- 
quences are not released to the public. Nevertheless, 
there are some obvious suspects. First, the nontrivial 
susceptibilities k and h mentioned above depend on the 
areal density of probe, which is sequence-dependent by 
virtue of the varying efficiencies of the lithography pro- 
cess. Secondly, nucleic acids need to unstack the single- 
stranded probes in order to form each new duplex as 
they hybridize. Further, stacking energies are extremely 
sensitive to sequence details, which might result in large 
energy barriers. This would translate into kinetics con- 
stants that vary exponentially (a la Arrhenius) in these 
energies, and lead to important consequences as the hy- 
bridization reactions are not carried to full thermody- 
namic equilibrium. 

Given a set of N experiments, further insight can be 
obtained by following a pair Pi = (log PAf^, log MM^) 
with i = 1,...,A^ across the entire dataset (after sub- 
tracting B). Ideally, these points would fall on a curve 
parametrizable by the mRNA concentration. In reality, 
however, the observed patterns range from nearly one- 
dimensional to almost circular clouds. To classify probes, 
we computed the center of mass CM and inertia tensor 
1 of the set of points {Pi}. The positive eigenvalues of 
T, Ii > I2 define the eccentricity e = 1/ Ii /I2 and largest 
excursion Ai = \/Ji- Pairs with high eccentricities are 
those carrying high S/N, whereas e ~ 1 characterizes a 
very noisy probe pair. 

Fig. 2 illustrate the distribution of center of mass after 
different filtering for e and Ai. It turns out that Fig. 2a 
looks very similar to Fig. 1, which is not a priori evi- 
dent. On the contrary, this similarity emphasizes that 
most probes behave in a very reproducible manner. For 
instance, probes lying below the PM = MM diagonal 
at the high-intensity end do so in essentially all of the 
86 experiments (leading to a CM that is also below the 
diagonal), instead of visiting different regions of the plot. 
Another striking result is that (i) selecting for e > 3 
eliminates most of the low-intensity probes (Fig. 2b), 
(ii) the remaining set contains two components: one con- 
sisting of the large Ai probes (Fig. 2c) lying mostly in 
the PM > MM region; and the small A2 component 
forming an almost perfectly symmetric "tulip" structure 
(Fig. 2d) , containing two forked branches plus the button 
described in Fig. 1. 

Another troubling effect which deeply affects attempts 
at analysis is the very broad brightness distributions 
within probes belonging to the same gene. Fig. 4 shows 
that the PM probe intensities span up to four decades. 
Possible reasons for such behavior are again sequence spe- 
cific effect similar to those discussed in the context of the 
MM behavior. 
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FIG. 3: Relative PM intensity distributions within probe- 
sets (after subtracting B). The data shows the 86 HG- 
U95A human chips used previously. Probesets are split into 
three groups according to their median PM intensity. In all 
cases, the distributions of PA'//median(PM) span up to four 
decades. Notice there are signs of saturation in the right tail 
of the high-intensity set. 

The main practical challenge is reconstructing the tar- 
get mRNA concentration from the probeset data. As we 
showed, the variability in the hybridization properties of 
the probes is larger than naively anticipated, therefore, 
it is unlikely that a single definitive procedure will be ap- 
propriate in all cases. On the contrary, it is desirable to 
have several analysis tools at hand for viewing the data 
from different angles. For instance, as a consequence of 
the strongly probe dependent susceptibilities a, k and h, 
the differential PM—MM will not consistently be a good 
estimator of the true signal. Given the unclear informa- 
tion contained in the MM, one alternative we studied 
is not considering them at all. The mRNA expression 
level is then obtained from a robust geometric average 
of the PM-B values, after a careful estimation of B [[l6[ . 
The use of geometric averages (rather that arithmetic) is 
dictated by the distributions in Fig. 4. Of course, using 
only PM probes neglects cross-hybridization effects that 
would be detectable by a working MM probe, and hence 



tends to be less sensitive at the low-intensity end. One 
the other hand, it allows to rescue probesets with a high 
number of misbehaving MMs. 

A completely different approach, closer in spirit to the 
model-based method would be to extend the ellip- 
soid of inertia idea to the the full probeset. Concretely, 
one would take the matrix = {log PM^, log MM^) 
(j = l,...,iVp is the probe and i the experiment in- 
dex) and do a principal component analysis to iden- 
tify the modes carrying the most signal. After singular 
value decomposition A — U AV^ , where A^ = A^ — mP 
and — A^ is the center of mass, the signal 

Si — '^j{raj -\- Al) is given by the projection onto the 
largest direction of variation. A signal-to-noise measure 
for the entire probeset is given by = — , Pre- 

liminary testing of the method has lead to very promising 
results. 

In conclusion, we showed that the hybridization of 
short length DNA sequences to single mismatched tem- 
plates exhibits a far more diverse picture than what is 
usually assumed. These observations do not only point 
at interesting physics in the DNA hybridization process 
to short sequences with defects, attached to a glass sur- 
face; they also have strong consequences for designers of 
GeneChip analysis tools, especially when it comes to the 
level of noise rejection of different methods. We hope this 
will bolster interest in the physics of hybridization and 
mismatch characterization. 
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